Integrated single-cell dissection of tumor epithelial states, programs. This notebook assembles all analyses and visualizations underlying Figure 3, providing a comprehensive single-cell view of epithelial cell-state diversity, pathway activity, genotype-dependent composition, and lineage trajectories in colorectal cancer models.
The figure is structured into the following analytical layers:
Global epithelial landscape (Fig. 3a–c)
UMAP embeddings are used to visualize:
Tumor state composition (Fig. 3d)
Stacked barplots quantify the relative abundance of epithelial
states:
hash.ID), ordered by tumor
model.Compositional heterogeneity across models (Fig.
3e)
PCA of per-sample epithelial state proportions reveals genotype-driven
structure in tumor ecosystem composition, visualized in PC1–PC2 space
and colored by tumor model.
Gene-program activity across states (Fig.
3f)
A curated panel of oncogenic, developmental, inflammatory, and metabolic
gene sets is scored using AddModuleScore.
Pearson correlations between signature activity and cluster membership
are computed and summarized as a bubble plot, highlighting pathway–state
associations.
Together, these analyses integrate cell-state annotation, functional gene-program activity, compositional remodeling, and developmental trajectories to define how distinct oncogenic genotypes and microenvironmental programs shape epithelial plasticity and tumor evolution.
Description:
This section initializes the analysis environment by loading all R
packages required for downstream single-cell visualization and analysis.
Seurat is used for handling and manipulating the
single-cell object, dplyr and ggplot2 for data
wrangling and plotting, while SCpubr and
Nebulosa provide publication-quality UMAP and density
visualizations.
The preprocessed tumor epithelial Seurat object is then read from
disk and stored as tumcells. This object contains all
expression data, dimensionality reductions (e.g. UMAP), and metadata
(e.g. tumor states, stages, module scores) used in all subsequent Figure
3 panels.
Inputs: - R packages: Seurat,
dplyr, ggplot2, SCpubr,
Nebulosa - Serialized Seurat object file:
tumorcells.rds
Output: - tumcells: Seurat object
containing the tumor epithelial single-cell dataset used for all
downstream analyses and visualizations.
# Suppress package startup messages for cleaner notebook output
suppressPackageStartupMessages({
library(Seurat) # single-cell object handling and analysis
library(dplyr) # data manipulation
library(ggplot2) # plotting
library(SCpubr) # publication-style Seurat visualizations
library(Nebulosa) # kernel density feature plots on embeddings
})
# Load tumor epithelial Seurat object
tumcells <- readRDS("tumorcells.rds")Figure 3a–c – Global visualization of epithelial heterogeneity and molecular subtypes
Description:
This section presents three complementary UMAP visualizations of the
tumor epithelial compartment to summarize transcriptional heterogeneity,
pathological progression, and molecular subtype programs:
Panel 3a (Tumor epithelial states):
Cells are colored by curated transcriptional state annotations
(tumcells_annotation), highlighting major epithelial
programs such as WNT-modulating, crypt progenitor, basal/stress,
cycling, EMT, squamous-like, neuroendocrine, and regenerative states. A
fixed, publication-quality color palette (tumor_cols) is
used to ensure consistent state identity across all downstream
figures.
Panel 3b (Tumor stage):
The same UMAP embedding is colored by histopathological stage
(tum_stage), distinguishing normal tissue, adenoma, and
carcinoma. This view links transcriptional state distributions to
disease progression.
Panel 3c (iCMS programs):
Cells are classified as iCMS2-like or iCMS3-like based
on the relative dominance of the corresponding module scores (e.g.,
iCMS21 vs iCMS31). Each cell is assigned to
the program with the higher score, and the resulting binary
classification is visualized on the UMAP to reveal spatial segregation
and overlap of consensus molecular subtype–like programs within the
epithelial compartment.
Together, these panels provide a global overview of epithelial cell-state diversity, its relationship to tumor stage, and its alignment with human CRC iCMS transcriptional programs.
Inputs expected: - tumcells Seurat
object with: - UMAP embedding - tumcells_annotation
(epithelial state labels) - tum_stage (Normal tissue /
Adenoma / Carcinoma) - Module scores iCMS21 and
iCMS31
Outputs: - UMAP colored by epithelial tumor states
(Figure 3a)
- UMAP colored by tumor stage (Figure 3b)
- UMAP colored by iCMS2-like vs iCMS3-like assignment (Figure 3c)
## ============================================================
## Figure 3a – Tumor epithelial cell states on UMAP
## ============================================================
# Define color palette for curated tumor cell state annotations
tumor_cols <- c(
"WNT-modulating epithelium" = "#E76F51",
"Crypt progenitor" = "#2A9D8F",
"Basal/stress epithelium" = "#F4A261",
"Cycling G2/M" = "#264653",
"Immune regulatory epithelium" = "#8AB17D",
"Squamous-like epithelium" = "#5C53A5",
"EMT epithelium" = "#D264B6",
"Metabolic absorptive epithelium" = "#6D597A",
"Goblet-like epithelium" = "#4CC9F0",
"IFN-responsive epithelium" = "#FF6D00",
"Neuroendocrine epithelium" = "#577590",
"Absorptive enterocyte-like epithelium" = "#F9C74F",
"Ductal-like regenerative epithelium" = "#90BE6D",
"Ductal-metaplastic epithelium" = "#277DA1"
)
# UMAP colored by tumor epithelial state annotation
DimPlot(
tumcells,
cols = tumor_cols,
group.by = "tumcells_annotation"
)## ============================================================
## Figure 3b – Tumor stage on UMAP
## ============================================================
# Color palette for histopathological stage
stage_cols <- c(
"Carcinoma" = "#C94959", # red
"Adenoma" = "#F1C232", # yellow
"Normal tissue" = "#3FA7B5" # teal
)
# UMAP colored by tumor stage with publication-style formatting
p_stage <- DimPlot(
tumcells,
reduction = "umap",
group.by = "tum_stage",
pt.size = 0.2,
cols = stage_cols
) +
labs(title = "Tumour stage") +
theme_classic(base_size = 14) +
theme(
legend.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_blank(),
axis.ticks = element_blank()
)
p_stage## ============================================================
## Figure 3c – iCMS2-like vs iCMS3-like assignment on UMAP
## ============================================================
# Assign iCMS class based on dominant module score per cell
tumcells$iCMS_assignment <- ifelse(
tumcells$iCMS21 >= tumcells$iCMS31,
"iCMS2_like",
"iCMS3_like"
)
# Enforce consistent factor level ordering
tumcells$iCMS_assignment <- factor(
tumcells$iCMS_assignment,
levels = c("iCMS2_like", "iCMS3_like")
)
# Sanity check of class distribution
table(tumcells$iCMS_assignment)
# UMAP visualization of iCMS class assignment
DimPlot(
tumcells,
group.by = "iCMS_assignment",
cols = c("steelblue", "firebrick")
)Figure 3d – Stacked barplots of epithelial state proportions
Description:
This section quantifies and visualizes the relative abundance of curated
tumor epithelial cell states (as defined in
tumcells_annotation) using stacked barplots. Two
complementary summaries of tumor-state composition are generated:
Per-sample composition (by
hash.ID):
For each multiplexed sample, the proportion of cells belonging to each
epithelial state is calculated and displayed as a stacked bar. The
x-axis is ordered by tumor_model, allowing comparison of
state composition across individual samples while retaining genotype
context.
Per–tumor model composition (by
tumor_model):
All cells belonging to the same tumor model are pooled, and state
proportions are computed in a cell-weighted manner. The resulting
stacked barplot summarizes the average epithelial-state landscape
characteristic of each genetic model.
In both representations, a consistent ordering of epithelial states
(annot_levels) is enforced and a fixed color palette
(tumor_cols) is applied. This ensures direct visual
comparability of state distributions across samples, tumor models, and
downstream panels (e.g., PCA in Figure 3e).
Inputs expected: - tumcells Seurat
object with: - tumcells_annotation (epithelial state
labels) - hash.ID (sample identifier) -
tumor_model (genotype / model label) -
annot_levels: ordered vector of epithelial states -
tumor_cols: named color palette for epithelial states -
custom_order: desired ordering of tumor models on the
x-axis
Outputs: - p_hash_states: stacked
barplot of epithelial-state proportions per sample (hash.ID)
- p_model_states: stacked barplot of epithelial-state
proportions per tumor model
## ============================================================
## Figure 3d – Proportion plots of tumor epithelial states
## ============================================================
## ============================================================
## 0) Setup
## ============================================================
library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)
## ============================================================
## 1) Define tumor-state order (recommended)
## Ensures consistent ordering of stacks/legends across plots.
## ============================================================
annot_levels <- c(
"WNT-modulating epithelium",
"Crypt progenitor",
"Basal/stress epithelium",
"Cycling G2/M",
"Immune regulatory epithelium",
"Squamous-like epithelium",
"EMT epithelium",
"Metabolic absorptive epithelium",
"Goblet-like epithelium",
"IFN-responsive epithelium",
"Neuroendocrine epithelium",
"Absorptive enterocyte-like epithelium",
"Ductal-like regenerative epithelium",
"Ductal-metaplastic epithelium"
)
# Enforce tumor_annotation factor order in meta.data
# (creates/overwrites md$tumor_annotation from tumcells_annotation)
md <- tumcells@meta.data
md$tumor_annotation <- md$tumcells_annotation
md$tumor_annotation <- factor(md$tumor_annotation, levels = annot_levels)
tumcells@meta.data <- md
## ============================================================
## 2) Proportions PER SAMPLE (hash.ID)
## Calculates within-sample fractions of each tumor_state.
## ============================================================
df_prop_state_hash <- data.frame(
hash.ID = tumcells$hash.ID,
tumor_state = tumcells$tumor_annotation,
tumor_model = tumcells$tumor_model
) %>%
filter(!is.na(hash.ID), !is.na(tumor_state), !is.na(tumor_model)) %>%
group_by(hash.ID, tumor_model, tumor_state) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(prop = n / sum(n)) %>%
ungroup()
# Ensure tumor_state ordering is retained in downstream plotting
df_prop_state_hash$tumor_state <- factor(df_prop_state_hash$tumor_state, levels = annot_levels)
## ============================================================
## 3) Enforce tumor_model ordering + derive hash.ID order for x-axis
## Ensures samples are grouped and ordered by tumor_model.
## ============================================================
custom_order <- c(
"Colon-WT",
"VA","VAKP","VAKPS","VBP","VBPN","VBPNA","VBPNC","VKP","VKPN",
"AKP-BFP","AKPS-BFP","AKP-STAR","AKP-Arid1a-STAR","AKP-Atm-STAR",
"AKP-Atrx-STAR","AKP-Fbxw7-STAR","AKP-Smad4-STAR","AKPPten-STAR","AKP-Ptprt-STAR"
)
# Keep models that exist, then append any unexpected models at the end
present_levels <- intersect(custom_order, unique(df_prop_state_hash$tumor_model))
present_levels <- c(present_levels, setdiff(unique(df_prop_state_hash$tumor_model), present_levels))
df_prop_state_hash$tumor_model <- factor(df_prop_state_hash$tumor_model, levels = present_levels)
# Order hash.ID by tumor_model, then within each model alphabetically by hash.ID
sample_order <- df_prop_state_hash %>%
distinct(hash.ID, tumor_model) %>%
arrange(tumor_model, hash.ID) %>%
pull(hash.ID)
df_prop_state_hash$hash.ID <- factor(df_prop_state_hash$hash.ID, levels = sample_order)
## ============================================================
## 4) Stacked barplot per hash.ID (x-axis labels show tumor_model)
## ============================================================
# Tumor-state color palette (must match Figure 3a colors)
tumor_cols <- c(
"WNT-modulating epithelium" = "#E76F51",
"Crypt progenitor" = "#2A9D8F",
"Basal/stress epithelium" = "#F4A261",
"Cycling G2/M" = "#264653",
"Immune regulatory epithelium" = "#8AB17D",
"Squamous-like epithelium" = "#5C53A5",
"EMT epithelium" = "#D264B6",
"Metabolic absorptive epithelium" = "#6D597A",
"Goblet-like epithelium" = "#4CC9F0",
"IFN-responsive epithelium" = "#FF6D00",
"Neuroendocrine epithelium" = "#577590",
"Absorptive enterocyte-like epithelium" = "#F9C74F",
"Ductal-like regenerative epithelium" = "#90BE6D",
"Ductal-metaplastic epithelium" = "#277DA1"
)
# Map hash.ID → tumor_model for x-axis labels (labels are "encrypted strings" = model IDs)
id_to_model <- df_prop_state_hash %>%
distinct(hash.ID, tumor_model) %>%
deframe()
p_hash_states <- ggplot(df_prop_state_hash,
aes(x = hash.ID, y = prop, fill = tumor_state)) +
geom_col(width = 0.8, color = "black", size = 0.3) +
scale_y_continuous(
breaks = c(0, 0.5, 1),
labels = function(x) x * 100,
expand = expansion(mult = c(0, 0))
) +
scale_fill_manual(
values = tumor_cols,
breaks = annot_levels,
limits = annot_levels,
drop = FALSE
) +
scale_x_discrete(labels = id_to_model) +
labs(x = "tumor_model", y = "Tumor cell states (%)") +
theme_classic(base_size = 7) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1),
legend.title = element_blank()
)
p_hash_states## ============================================================
## Figure 3d (alternate view) – Same composition summarized per tumor_model
## ============================================================
## ============================================================
## 0) Setup
## ============================================================
library(Seurat)
library(dplyr)
library(tidyr)
library(ggplot2)
## ============================================================
## 1) Define tumor-state order (recommended)
## ============================================================
annot_levels <- c(
"WNT-modulating epithelium",
"Crypt progenitor",
"Basal/stress epithelium",
"Cycling G2/M",
"Immune regulatory epithelium",
"Squamous-like epithelium",
"EMT epithelium",
"Metabolic absorptive epithelium",
"Goblet-like epithelium",
"IFN-responsive epithelium",
"Neuroendocrine epithelium",
"Absorptive enterocyte-like epithelium",
"Ductal-like regenerative epithelium",
"Ductal-metaplastic epithelium"
)
# Re-enforce tumor_annotation factor levels (safety)
md <- tumcells@meta.data
md$tumor_annotation <- factor(md$tumor_annotation, levels = annot_levels)
tumcells@meta.data <- md
## ============================================================
## 2) Enforce tumor_model ordering (your tradition)
## ============================================================
custom_order <- c(
"Colon-WT",
"VA","VAKP","VAKPS","VBP","VBPN","VBPNA","VBPNC","VKP","VKPN",
"AKP-BFP","AKPS-BFP","AKP-STAR","AKP-Arid1a-STAR","AKP-Atm-STAR",
"AKP-Atrx-STAR","AKP-Fbxw7-STAR","AKP-Smad4-STAR","AKPPten-STAR","AKP-Ptprt-STAR"
)
## ============================================================
## 3) Proportions PER TUMOR MODEL (tumor_model)
## Cell-weighted composition pooled across all samples in a model.
## ============================================================
df_prop_state_model <- data.frame(
tumor_model = tumcells$tumor_model,
tumor_state = tumcells$tumor_annotation
) %>%
filter(!is.na(tumor_model), !is.na(tumor_state)) %>%
mutate(
tumor_model = as.character(tumor_model),
tumor_state = factor(tumor_state, levels = annot_levels)
) %>%
group_by(tumor_model, tumor_state) %>%
summarise(n = n(), .groups = "drop_last") %>%
mutate(prop = n / sum(n)) %>%
ungroup()
# Order tumor_model by custom_order, append any unexpected models at the end
present_levels <- intersect(custom_order, unique(df_prop_state_model$tumor_model))
present_levels <- c(present_levels, setdiff(unique(df_prop_state_model$tumor_model), present_levels))
df_prop_state_model$tumor_model <- factor(df_prop_state_model$tumor_model, levels = present_levels)
## ============================================================
## 4) Colors for tumor states
## ============================================================
tumor_cols <- c(
"WNT-modulating epithelium" = "#E76F51",
"Crypt progenitor" = "#2A9D8F",
"Basal/stress epithelium" = "#F4A261",
"Cycling G2/M" = "#264653",
"Immune regulatory epithelium" = "#8AB17D",
"Squamous-like epithelium" = "#5C53A5",
"EMT epithelium" = "#D264B6",
"Metabolic absorptive epithelium" = "#6D597A",
"Goblet-like epithelium" = "#4CC9F0",
"IFN-responsive epithelium" = "#FF6D00",
"Neuroendocrine epithelium" = "#577590",
"Absorptive enterocyte-like epithelium" = "#F9C74F",
"Ductal-like regenerative epithelium" = "#90BE6D",
"Ductal-metaplastic epithelium" = "#277DA1"
)
## ============================================================
## 5) Stacked barplot per tumor_model
## ============================================================
p_model_states <- ggplot(df_prop_state_model,
aes(x = tumor_model, y = prop, fill = tumor_state)) +
geom_col(width = 0.8, color = "black", size = 0.3) +
scale_y_continuous(
breaks = c(0, 0.5, 1),
labels = function(x) x * 100,
expand = expansion(mult = c(0, 0))
) +
scale_fill_manual(
values = tumor_cols,
breaks = annot_levels,
limits = annot_levels,
drop = FALSE
) +
labs(x = "tumor_model", y = "Tumor cell states (%)") +
theme_classic(base_size = 7) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1),
legend.title = element_blank()
)
p_model_statesFigure 3e – Principal component analysis of tumor-state proportions per sample
Description:
This section performs principal component analysis (PCA) on the
composition of tumor epithelial states at the per-sample level (defined
by hash.ID). Starting from the long-format proportion table
(df_prop_state_hash, generated in Figure 3d), the data are
reshaped into a wide matrix (samples × tumor states), where each entry
represents the fraction of a given epithelial state within a sample. PCA
is then computed on this composition matrix using
prcomp.
Samples are visualized in the PC1–PC2 space and colored by
tumor_model. Axis labels report the percentage of variance
explained by the first two principal components, enabling interpretation
of major axes of variation in epithelial-state composition across
genotypes and disease stages.
Inputs expected: - df_prop_state_hash
(from Figure 3d), containing: - hash.ID (sample
identifier)
- tumor_state (epithelial state annotation)
- prop (within-sample proportion)
- tumor_model (genotype / model label) -
custom_order (from Figure 3d), used to enforce consistent
ordering of tumor models and color assignment
Output: - p_pca_model_states (ggplot
object):
A PCA scatter plot showing samples in PC1–PC2 space, colored by tumor
model, summarizing global differences in epithelial-state composition
across tumors.
## ============================================================
## Figure 3e – PCA on tumor-state composition per sample (hash.ID)
## ============================================================
## ============================================================
## 5) PCA on tumor-state composition per sample
## - Build composition matrix (samples x tumor states)
## - Run PCA on per-sample state proportions
## ============================================================
comp_mat_state <- df_prop_state_hash %>%
select(hash.ID, tumor_state, prop) %>%
pivot_wider(
names_from = tumor_state,
values_from = prop,
values_fill = 0
)
# Convert to numeric matrix for PCA (rows = samples; cols = tumor states)
comp_mat_state_mat <- as.matrix(comp_mat_state[, -1, drop = FALSE])
rownames(comp_mat_state_mat) <- comp_mat_state$hash.ID
# PCA on composition (no scaling; composition already on comparable scale)
pca_state <- prcomp(comp_mat_state_mat, scale. = FALSE)
# Variance explained by principal components (for axis labels)
pcVar_state <- summary(pca_state)
varPC1_state <- pcVar_state$importance[2, 1]
varPC2_state <- pcVar_state$importance[2, 2]
# Extract PC1/PC2 coordinates per sample
mat_state <- as.data.frame(pca_state$x[, 1:2, drop = FALSE])
colnames(mat_state)[1:2] <- c("PC1", "PC2")
mat_state$hash.ID <- rownames(mat_state)
# Annotate PCA coordinates with tumor_model (one model per hash.ID)
anno_state <- df_prop_state_hash %>%
distinct(hash.ID, tumor_model)
mat_state <- left_join(mat_state, anno_state, by = "hash.ID")
# Enforce tumor_model ordering for consistent legend ordering across figures
present_levels_pca <- intersect(custom_order, unique(mat_state$tumor_model))
present_levels_pca <- c(present_levels_pca, setdiff(unique(mat_state$tumor_model), present_levels_pca))
mat_state$tumor_model <- factor(mat_state$tumor_model, levels = present_levels_pca)
## ============================================================
## 6) Color palette for models
## - Assign distinct colors to each tumor_model level present
## ============================================================
pal_models_final <- setNames(
c(
"#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD",
"#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF",
"#AEC7E8", "#FFBB78", "#98DF8A", "#FF9896", "#C5B0D5",
"#C49C94", "#F7B6D2", "#C7C7C7", "#DBDB8D", "#9EDAE5",
"#393B79", "#8C6D31"
)[seq_along(levels(mat_state$tumor_model))],
levels(mat_state$tumor_model)
)
## ============================================================
## 7) PCA plot
## - PC1 vs PC2 colored by tumor_model
## ============================================================
p_pca_model_states <- ggplot(mat_state, aes(x = PC1, y = PC2, color = tumor_model)) +
geom_point(size = 5, alpha = 0.92) +
coord_equal() +
scale_color_manual(values = pal_models_final) +
xlab(sprintf("PC1 (%.1f%% variance explained)", varPC1_state * 100)) +
ylab(sprintf("PC2 (%.1f%% variance explained)", varPC2_state * 100)) +
ggtitle("PCA of samples by tumor-cell-state composition") +
guides(color = guide_legend(ncol = 2, override.aes = list(size = 4))) +
theme_classic(base_size = 10) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(face = "bold"),
legend.title = element_blank(),
legend.position = "right"
)
p_pca_model_statesFigure 3f – Correlation dot plot linking gene-set activity to tumor cell clusters
Description:
This section (i) defines a curated panel of epithelial/tumor programs
together with selected Hallmark and GO gene sets, (ii) computes module
scores for each gene set in tumcells using
AddModuleScore, (iii) calculates Pearson correlations
between each signature score and cluster membership (one-vs-all binary
vectors derived from seurat_clusters), and (iv) visualizes
the resulting signature × cluster correlation matrix as a bubble (dot)
plot, in which both color and size encode the correlation
coefficient.
To reduce redundancy and improve interpretability, highly similar or
overlapping signatures are collapsed using a “best cluster / best
correlation” strategy, and signatures are ordered according to their
maximal association with specific clusters.
Inputs expected: - tumcells Seurat
object containing: - seurat_clusters (cluster identities) -
tumcells_annotation (cell-state labels) in
@meta.data - Gene-set definitions (epithelial programs,
Hallmark, and GO sets) - A helper function base_name_fun()
used to collapse duplicate or closely related signatures (as referenced
in the code)
Output: - A signature × cluster correlation bubble plot (Figure 3f), with dot color and size proportional to the Pearson correlation coefficient.
## ============================================================
## Figure 3f – Correlation dot plot of gene sets vs. tumor cell-state clusters
## ============================================================
## Description:
## This chunk (i) defines curated gene sets, (ii) computes per-cell module scores
## using AddModuleScore, (iii) correlates each gene-set score with one-vs-rest
## cluster membership, (iv) collapses duplicate/related gene sets and orders them
## by their best-associated cluster, and (v) visualizes the correlation landscape
## as a bubble grid (size + color encode correlation).
##
## Inputs expected in `tumcells@meta.data`:
## - seurat_clusters
## - tumcells_annotation
## - expression rownames matching gene symbols in gene sets
##
## Output:
## - p : ggplot object (bubble plot)
## ============================================================
## ============================================================
## 0) Packages
## ============================================================
library(Seurat)
library(ggplot2)
library(scales) # for rescale()
## ============================================================
## 0.1) Helper: collapse signature names to "base" pathway
## (removes trailing "_cl<number>" if present)
## ============================================================
base_name_fun <- function(x) gsub("_cl[0-9]+$", "", x)
## ============================================================
## 1) DEFINE GENE SETS
## (original signatures + selected Hallmark/GO sets)
## ============================================================
## --- original tumor programs --------------------------------
oncofetal_genes <- c(
"Syt8","Vsig1","Serpinb5","Samd5","4930461G14Rik","Ly6a","Anxa1",
"Cdkn2a","Anxa3","S100a14","Ddah1","Ahnak","S100a6","Emp1","Bok",
"Ly6f","Gsn","Fmnl2","Mgat4c","Cxcl16","Pmepa1","Krt23","Rbms1",
"Tinagl1","Lmna","Anxa5","S100a11","Ier3","Avpi1","Krt7","Itgb4",
"Krt18","Cnn2","Gpx2","Emp2","Ccnd1","Ly6d","Kcnn4","Car2","Ctsh",
"Cd44","Fabp5","Tubb6","Col18a1","Col4a2","Timp2","Col4a1","Msln",
"Efemp1","1110028F11Rik","Serpinb11"
)
EpiHR_mouse <- c(
"Ngrn","Fbxl13","Itga3","Slc25a25","Sema4b","Mapk3","Rin1","Pnpla2",
"Rhod","Mid1ip1","Ddr1","Lamb3","Flrt1","Prss22","Lamc2","Hoxa4",
"Fam83a","Exoc7","Tnfrsf21","Ripk4","Prph","Krt19","Colec10","Spint1",
"Vgll1","Znf185","Gjb3","Paep","Rimbp2","Wfdc3","Gas6","Htr2c","Mroh6",
"Capn12","Fbxl14","Magi2","Anxa3","Ppl","Baiap2","Lingo1","Klk7","Epha2",
"Krt17","Scgn","Klk5","Lama3","Eps8l1","Fxyd4","Snx24","Slc22a18","Klk8",
"Kcnn4","Rasal1","Lemd1","Kcnk3","Rnf39","Hist2h2be","Lmo7","C1orf116",
"Psors1c1","Gprc5a","Sez6l2","C19orf33","Gdpd3","Klk10","Slc20a1","Klk11",
"Klk6","Scel","Kcnj16","Krt80","Camk2n1","Krt6a","Disp2","Slc2a1","Scnn1a",
"Tm4sf4","Msln","Col17a1","Fam155a","Krt6b","Pla2g10","Mybpc1","Sfta2",
"Slpi","Vsig2","Rhbdl2","Serpinb5","S100a2","Mmp7","Ctse","Nxf3","Gsta2",
"Gsta1","Clstn2","Hoxb8","Gabrp","Tacstd2","Dusp27"
)
Ki67_mouse <- c(
"Cdkn1b","Pif1","Ccna1","Gtse1","Cep55","Kif20b","Cenpe","Kif18a","Bub1",
"Cdc25b","Gjb3","Knstrn","Cdca3","Kif18b","Melk","Troap","Ddias","Ckap2",
"Kif20a","Mis18bp1","Ccnb2","Cdc20","Ckap2l","Ccnb1","Crym","Tpx2",
"Dlgap5","Bora","Nek2","Cdkn3","Aspm","Psrc1","Kif23","Pcsk6","Oscp1",
"Slc28a3","Nrarp","Cenpa","Alpp","Shcbp1","Arhgef39","Anln","Hmmr",
"Parpbp","Tlr1","Cdc25c","Ccnf","Aurka","Prc1","Cenpf","Gsdmc","Spc25",
"Kif2c","Kif14","Efcab11","Prr11","Kif22","Gsdmc","Plk1"
)
YAP_mouse <- c(
"Cyr61","Ctgf","Amotl2","Ankrd1","Igfbp3","F3","Fjx1","Nuak2","Lats2","Crim1",
"Gadd45a","Tgfb2","Ptpn14","Nt5e","Foxf2","Axl","Dock5","Asap1","Rbms3",
"Myof","Arhgef17","Ccdc80"
)
coreHRC_mouse <- c(
"Lamb3","Sdcbp2","Emp1","Eps8l1","Jup","Lmo7","Myof","Itgb4","Gprc5a","Ahnak",
"Camk2n1","Myh14","Cd55","Rhoc","Tmbim1","Sema3b","Tinagl1","Misp","Col17a1",
"Plec","Tspan1","Ptprh","Crb3","Itga3","Riok3","Lamc2","F2r","Rras","Muc13",
"Ceacam1","Tmprss2","Serinc2","Slc44a4","Pcdh1","Usp53","Flnb","Cxcl16","Ezr",
"Tnfrsf21","Cda","Adam9","Rnf103","Krt80","Efnb1","Smad3","Pdzkiip1","Lrp10",
"Prss22","Ahnak2","Anxa11","F11r","Dusp5","Ddr1","Itgb1","Slc2a1","Cdkn2b",
"Plaur","Itga2","Cdcp1","Cd59a","Dsg2","Klk10","Cd68","Cgn","Tm4sf1","Ppp2cb",
"Hebp2","Ugcg","Gbp3","Timp2","Pink1","Cystm1","Slc6a8","Ppard","Liph","Clic3",
"Spint1","Ptk6","Tjp3","Stk24","Ceacam5","Mxd1","Snx9","Ero1l","Ssf2","Tspan3",
"Slpi","Bmp2","Gjb3","Lama3","Tmprss4","B4galnt3"
)
fetal_mustata <- c(
"Spp1","Gja1","Basp1","Tacstd2","Sftpd","Mal","Clu","Wfdc2","Msln","Vsig1",
"Tubb6","Crip2","Anxa1","Col4a1","Il33","Col4a2","Krt4","Igfbp7","Akap13",
"Ecscr","Gkn1","Cdh13","Slc45a3","Efna5","Anxa8","Acsm3","Ppic","Anxa3",
"Eda2r","S100a6","Efemp1","Ddit4l","Pdzk1ip1","Ly6d","Laptm5","Plekhs1"
)
lgr5_genes_mouse <- c("Bcl11b","Axin2","Lgr5","Ascl2","Lrig1")
wnt_on_genes_mouse <- c(
"Ascl2","Axin2","Lgr5","Kiaa1199","Sp5","Cachd1","Cst1","Smoc2","Fam216a","Lef1",
"Cyp4x1","Slc12a2","Dpep1","Nkd1","Lrp4","Znrf3","Ptpro","Apcdd1","Tnfrsf19",
"Sox2","Ptch1","Tspan5","Myrip","Cdk6","Itga9","H19","Ppp2r2c","Igfl2"
)
KRASup_mouse <- c(
"Abcb1a","Abcb1b","Ace","Adam17","Adam8","Adamdec1","Adgra2","Adgrl4","Akap12","Akt2"
)
## --- selected Hallmark / GO gene sets from your GSEA ----------
GS_HALLMARK_WNT_BETA_CATENIN_SIGNALING <- c(
"Wnt6","Nkd1","Lef1","Gnai1","Tcf7","Axin2","Wnt5b","Hdac11",
"Ptch1","Csnk1e","Hdac5"
)
GS_HALLMARK_TNFA_SIGNALING_VIA_NFKB <- c(
"F3","Edn1","Cxcl5","Plaur","Fosl1","Cd44","Sphk1","Phlda2","Rel","Pmepa1",
"Tnfaip2","Tnfaip8","Tnfaip3","F2rl1","Tnf","Icam1","Clcf1","Ccn1","Plk2",
"Areg","Sat1","Atp2b1","Ackr3","Trib1","Lamb3","Smad3","Sdc4","Cxcl2",
"Hbegf","Ier3","Il23a","Pfkfb3","Tnip1","Csf1","Zc3h12a","Rela","Rcan1",
"Inhba","Klf6","Nfkbia","Tiparp","Ehd1","Birc3","Nfkb1","Litaf","Efna1",
"Nfkb2","Ifngr2","Ptger4","Relb","Dusp1","Tgif1","Bcl6","Cdkn1a","Tank",
"Eif1","Sqstm1","Gch1","Cxcl1","B4galt1"
)
GS_HALLMARK_INFLAMMATORY_RESPONSE <- c(
"F3","Edn1","Cxcl5","Plaur","Sgms2","Sphk1","Cd55","Icam1","Tnfsf10","Atp2b1",
"Pvr","Osmr","Slc4a4","Met","P2ry2","Hbegf","Csf1","Rela","Inhba","Klf6",
"Nfkbia","Nfkb1","Ifngr2","Abi1","Ptger4","Il4ra","Slc31a1","Rhog","Ahr",
"Cdkn1a","Gch1","Irak2"
)
GS_HALLMARK_KRAS_SIGNALING_UP <- c(
"Cidea","Wnt7a","Spp1","Ngf","Igfbp3","Aldh1a3","Angptl4","Cpe","Spon1",
"Emp1","Gprc5b","Rbp4","Ptprr","Traf1","Hdac9","Il33","Ets1","Itga2",
"Inhba","Plau","Plaur","Ano1","Kcnn4","Slpi","Etv4","St6gal1","Tnnt2",
"Laptm5","Nrp1","Dcbld2","Ank","Prkg2","Plat","Arg1","Btc","Tmem176a",
"Tspan7","Ammecr1","Tmem176b","Galnt3","Sox9","F2rl1"
)
GS_HALLMARK_TNFA_SIGNALING_VIA_NFKB_ALT <- c(
"Tnfaip2","Fjx1","Clcf1","Ccnd1","Jag1","Cd44","Sphk1","Traf1","Inhba",
"Plau","Plaur","Sdc4","Cxcl1","Pmepa1","Plpp3","Cebpb","Ier3","Atp2b1",
"Tiparp","Cxcl5","Tsc22d1","Dusp1","Il23a","Klf2","F2rl1","Fosl1","Tnfsf9",
"Smad3","Trip10","Areg","B4galt1","Tnfaip8","Ackr3","Vegfa","Dusp5","Hbegf",
"Slc16a6","Cxcl2","Litaf","Id2","Eif1","Klf6","Klf9","Sik1","Sat1"
)
GS_HALLMARK_INTERFERON_GAMMA_RESPONSE <- c(
"Ifit1bl1","Ifit3","Ccl7","Oas3","Ifi27l2b","Irf7","Cd38","Batf2","Ifi44",
"Ddx60","Sp110","Oas2","Mettl7b","Isg15","Il7","Dhx58","Oasl1","Zbp1",
"Ube2l6","Gbp3","Casp7","Casp1","Upp1","Rsad2","Ifih1","Rtp4","Ifit2","Mvp",
"Tap1","Parp12","Psmb9","Parp14","Xaf1","Psmb10","Cfb","Sri","Sppl2a",
"Cmpk2","Hif1a","Samd9l","Txnip","Trim25","Epsti1","Herc6","Isg20","Pde4b",
"Psmb8","Lgals3bp","Gch1","Trafd1","Fas","Nmi","Stat1","Tdrd7","Pla2g4a",
"Ifi30","Eif4e3","Tnfaip3","Znfx1","H2-D1","Plscr1","Il15ra","Pfkp","Pnp2",
"Rnf213","H2-Aa","Ifi35","Casp8","Cdkn1a","Stat2","Samhd1","Helz2","Il4ra",
"B2m","Eif2ak2","Ifi27","Bst2"
)
GS_HALLMARK_INTERFERON_ALPHA_RESPONSE <- c(
"Ifit3","Ifi27l2b","Irf7","Batf2","Ifi44","Ddx60","Sp110","Oas1g","Gbp2",
"Isg15","Il7","Oas1a","Tmem140","Dhx58","Oasl1","Ube2l6","Gbp3","Casp1",
"Rsad2","Ifih1","Rtp4","Ifit2","Tap1","Parp12","Psmb9","Parp14","Trim12c",
"Cmpk2","Samd9l","Txnip","Trim25","Epsti1","Herc6","Uba7","Isg20","Psmb8",
"Lgals3bp","Parp9","Trafd1","Nmi","Tdrd7","Ifi30","Plscr1","Cnp","Ifi35",
"Casp8","Stat2","Helz2","Il4ra","B2m","Eif2ak2","Ifi27","Bst2"
)
GS_GOBP_NEG_REG_CHEMOKINE_PRODUCTION <- c(
"Oas3","Nr1h4","Oas1g","Oas1a","Lgals9","Epha2","Erbin","Mul1","F2rl1"
)
GS_GOBP_NEG_REG_INNATE_IMMUNE <- c(
"Oas3","Oas1g","Isg15","Ceacam1","Oas1a","Arg1","Dhx58","Pparg","Nlrx1",
"Parp14","H2-Q2","H2-T23","Nr1h3","Trafd1","Nmi","Lgals9","Tnfaip3",
"H2-D1","Stat2","Samhd1","Crk","Mul1","Mill2"
)
GS_GOBP_NEG_REG_IMMUNE_RESPONSE <- c(
"Oas3","Cd55","Oas1g","Lgals3","Isg15","Ceacam1","Oas1a","Arg1","Dhx58",
"Syk","Pparg","Nlrx1","Parp14","Nlrp6","H2-Q2","Anxa1","H2-T23","Nr1h3",
"Trafd1","Nmi","Zbtb7b","Parp3","Lgals9","Tnfaip3","H2-D1","Zc3h12a",
"Stat2","Samhd1","Il4ra","Crk","Gpx1","Mul1","Mill2","H2-Eb1"
)
GS_HALLMARK_EMT <- c(
"Igfbp2","Lama1","Mmp2","Wnt5a","Pcolce","Fstl1","Vcan","Sparc","Pfn2",
"Serpinh1","Cdh11","Lgals1","Bgn","Plod2","Pthlh","Cdh2","Nid2","Emp3",
"Bdnf","Igfbp3","Gja1","Fermt2","Col3a1","Gas1","Thbs2","Serpine2","Mgp",
"Igfbp4","Col7a1","Anpep","Fbln1","Plod1","Gem","Dpysl3","P3h1","Fbln2",
"Efemp2","Tgfbi","Col16a1","Gpc1","Timp1","Gadd45a","Col4a1","Col4a2",
"Vim","Cxcl1","Sntb1","Ecm1","Ccn2","Itgb1","Tgfb1","Calu"
)
GS_HALLMARK_ANGIOGENESIS <- c(
"Fstl1","Thbd","Vcan","S100a4","Fgfr1","Stc1","Col3a1","Timp1"
)
GS_GOBP_NEG_REG_WNT_PATHWAY <- c(
"Igfbp6","Igfbp2","Wnt5a","Sox2","Nog","Nkd2","Cdh2","Apcdd1","Dact1",
"Gli3","Wnt11","Cited1","Wif1","Cav1","Mdk","Grb10","Bmp2","Igfbp4",
"Notum","Fzd1","Tpbg","Nfatc4","Ptpru","Fgf9","Dkk3","Nkd1","Kremen1",
"Tle1","Nxn","Wwtr1","Fzd6","Rack1","Tle2"
)
GS_GOBP_POS_REG_STEM_CELL_PROLIFERATION <- c(
"Gja1","Cited1","Sox11","Fermt2","Hmga2","Ltbp3","Pdcd2"
)
GS_HALLMARK_OXPHOS <- c(
"Maob","Acat1","Casp7","Suclg1","Mgst3","Cox10","Retsat","Echs1","Decr1",
"Cpt1a","Aldh6a1","Slc25a5","Ech1","Acaa2","Hadhb","Cox5a","Mpc1","Ndufb5",
"Cox6c","Nqo2","Uqcrfs1","Ndufa5","Cox5b","Uqcrc2","Acadm","Cox7b","Phyh",
"Cox6a1","Sdhc","Ndufa1","Sdhd","Atp5o","Etfa","Uqcrc1","Atp1b1","Uqcrb",
"Hadha","Cox4i1","Cyc1","Ndufb8","Atp5c1","Sdha","Uqcrh","Cox8a","Mdh1",
"Ndufa9","Atp5g1","Uqcr10","Acadvl","Aco2","Sdhb","Cox7a2","Atp5a1","Fh1",
"Idh3b","Ndufc1","Ndufv1","Etfb","Surf1","Slc25a12","Ndufs7","Ndufv2",
"Cox7c","Atp5k","Idh3g","Atp5g3","Ndufs1","Atp5l","Slc25a20","Ndufa3","Ogdh",
"Idh3a","Atp5d","Vdac3","Atp5h","Ndufb4","Atp5b","Prdx3","Ndufs2","Pdha1",
"Ndufa8","Ndufa6","Cox6b1","Cs","Ndufab1","Ndufs3","Uqcr11","Ndufs6","Ndufb6",
"Cox7a2l","Atp5j2","Slc25a3","Atp5j","Vdac1","Ndufb2","Atp5g2","Alas1",
"Slc25a11","Vdac2","Etfdh","Ndufs8","Rhot2","Eci1","Isca1","Dld","Dlat",
"Hsd17b10","Ndufb3","Oxa1l","Uqcrq","Ndufs4","Acaa1a","Oat","Mtrf1","Mfn2",
"Cycs","Ndufa7","Immt","Aifm1","Atp5e","Afg3l2","Ndufb7","Fdx1","Mdh2"
)
GS_GOBP_MAINTENANCE_GI_EPITHELIUM <- c("Muc2","Tff3","Muc4","Rbp4","Muc13")
GS_GOBP_INTESTINAL_ABSORPTION <- c(
"Fabp1","Apoa4","Apoa1","Fabp2","Prap1","Npc1l1","Slc26a6","Mogat2","Cd36",
"Abcg8","Slc2a5","Abcg5","Soat2","Slc5a1","Gcnt3","Vdr","Vil1"
)
GS_GOBP_INTESTINAL_LIPID_ABSORPTION <- c(
"Apoa4","Apoa1","Fabp2","Prap1","Npc1l1","Cd36","Abcg8","Abcg5","Soat2"
)
GS_GOBP_REG_INTESTINAL_ABSORPTION <- c(
"Apoa4","Apoa1","Prap1","Abcg8","Abcg5","Lpcat3"
)
GS_GOBP_INNATE_IMMUNE_RESPONSE_MUCOSA <- c("Apoa4","Rpl39","Defa24")
GS_GOBP_INNATE_IMMUNE_RESPONSE <- c(
"Rsad2","Ifit2","Ifit1","Ifit3","Isg15","Apol9a","Gbp3","Gbp2","Gbp4",
"Ddx60","Zbp1","Bst2","Oas3","Oasl1","Oas2","Dhx58","Usp18","Gbp7","Nlrc5",
"Cxcl10","Oas1g","Irgm1","H2-Q7","Irf7","Stat2","Sp100","Oas1a","Stat1",
"Ifih1","Parp14","Znfx1","Gbp9","Irgm2","Trim21","Lgals9","Dtx3l","Trim34a",
"Ifi35","Ifit1bl1","C2","Parp9","Lcn2","Eif2ak2","Shfl","Trim12c","Nmi",
"H2-Q10","Tlr3","H2-T23","Ifitm3","Cd74","Trafd1","Isg20","Adar","Cfb",
"H2-D1","Samhd1","Trim25","Pml","Ltf","Zc3hav1","C4bp","Adam8","Padi4",
"H2-M3","Grn","Cfi","Nos2","Trim56","Fes","Socs1","Trim26","Csf1","Casp4",
"Vav1","Lacc1","Ifi27","Arrb2","Rnf135","Cyba","Tmem106a","Lbp","Trim14",
"Irf1","Mif","Ifitm1","Fcer1g","Pla2g2f","Cxcl16","Tnf","C3","Tyrobp",
"N4bp1","Ifnar2","Map4k2","Apobec3","Igha","H2-Q2","Myc","Ikbke","Ifitm2",
"Tlr2","Nub1","Cd14","Unc13d","Arid5a","Gsdmd","Unc93b1"
)
GS_HALLMARK_COAGULATION <- c(
"Clu","Plat","Htra1","Ctse","Anxa1","Mmp7","Timp1","Pdgfb","Ctsh","F3",
"Prss23","Fn1","Trf","Hpn","Gda","Itga2","Hmgcs2","Lamp2","Mmp15","Csrp1",
"Cfi","P2ry1","Mmp14"
)
GS_HALLMARK_ANGIOGENESIS_ALT <- c(
"Spp1","Cxcl5","Vav2","Timp1","Pglyrp1","Itgav","Ccnd2","Tnfrsf21","S100a4"
)
GS_HALLMARK_APICAL_SURFACE <- c(
"Ntng1","Pkhd1","Atp6v0a4","Gstm5","Sulf2","Adam10","Crocc"
)
## ============================================================
## 2) BUILD SIGNATURE LIST FOR AddModuleScore
## (names without _clXXX)
## ============================================================
sig_list <- list(
# core epithelial/tumor programs
WNTon = wnt_on_genes_mouse,
Lgr5 = lgr5_genes_mouse,
coreHRC = coreHRC_mouse,
Oncofetal = oncofetal_genes,
FetalMustata = fetal_mustata,
EpiHR = EpiHR_mouse,
YAP = YAP_mouse,
Ki67 = Ki67_mouse,
KRASup = KRASup_mouse,
# WNT / TNF / EMT / KRAS / IFN / innate / etc.
H_WNT_BETA = GS_HALLMARK_WNT_BETA_CATENIN_SIGNALING,
H_TNFA_NFKB = GS_HALLMARK_TNFA_SIGNALING_VIA_NFKB,
H_INFLAM = GS_HALLMARK_INFLAMMATORY_RESPONSE,
EMT = GS_HALLMARK_EMT,
KRAS_UP = GS_HALLMARK_KRAS_SIGNALING_UP,
TNFA_NFKB_2 = GS_HALLMARK_TNFA_SIGNALING_VIA_NFKB_ALT,
ANGIO = GS_HALLMARK_ANGIOGENESIS,
NEG_WNT = GS_GOBP_NEG_REG_WNT_PATHWAY,
STEM_PROL = GS_GOBP_POS_REG_STEM_CELL_PROLIFERATION,
IFNg = GS_HALLMARK_INTERFERON_GAMMA_RESPONSE,
IFNa = GS_HALLMARK_INTERFERON_ALPHA_RESPONSE,
NEG_CHEMOKINE = GS_GOBP_NEG_REG_CHEMOKINE_PRODUCTION,
NEG_INNATE = GS_GOBP_NEG_REG_INNATE_IMMUNE,
NEG_IMMUNE = GS_GOBP_NEG_REG_IMMUNE_RESPONSE,
INNATE = GS_GOBP_INNATE_IMMUNE_RESPONSE,
OXPHOS = GS_HALLMARK_OXPHOS,
MAINT_GI = GS_GOBP_MAINTENANCE_GI_EPITHELIUM,
INT_ABS = GS_GOBP_INTESTINAL_ABSORPTION,
INT_LIP_ABS = GS_GOBP_INTESTINAL_LIPID_ABSORPTION,
REG_INT_ABS = GS_GOBP_REG_INTESTINAL_ABSORPTION,
MUCOSA_INNATE = GS_GOBP_INNATE_IMMUNE_RESPONSE_MUCOSA,
COAG = GS_HALLMARK_COAGULATION,
ANGIO_ALT = GS_HALLMARK_ANGIOGENESIS_ALT,
APICAL_SURF = GS_HALLMARK_APICAL_SURFACE
)
# Keep only genes present in the Seurat object (avoids empty/missing features)
sig_list <- lapply(sig_list, function(gs) intersect(unique(gs), rownames(tumcells)))
## ============================================================
## 3) ADD MODULE SCORES TO tumcells
## Creates meta.data columns named: MS_<signature>_1
## ============================================================
for (nm in names(sig_list)) {
tumcells <- AddModuleScore(
tumcells,
features = list(sig_list[[nm]]),
name = paste0("MS_", nm, "_")
)
}
## ============================================================
## 4) CORRELATION: each signature vs. cluster membership
## - Build one-vs-rest cluster indicator matrix
## - Correlate indicator (0/1) with continuous module score
## ============================================================
meta_df <- tumcells@meta.data
meta_df$cluster <- as.character(tumcells$seurat_clusters)
# all module-score columns
score_cols <- grep("^MS_", colnames(meta_df), value = TRUE)
clusters_vec <- meta_df$cluster
cluster_levels <- unique(clusters_vec)
# binary matrix: columns = clusters; rows = cells; 1 if cell belongs to cluster
cluster_mat <- sapply(cluster_levels, function(cl) as.numeric(clusters_vec == cl))
colnames(cluster_mat) <- cluster_levels
# compute Pearson correlation(signature score, cluster membership)
cor_list <- lapply(score_cols, function(col) {
sig_values <- meta_df[[col]]
cors <- apply(cluster_mat, 2, function(x) {
cor(x, sig_values, use = "pairwise.complete.obs")
})
data.frame(
score_col = col,
cluster = names(cors),
correlation = cors,
stringsAsFactors = FALSE
)
})
cor_df <- do.call(rbind, cor_list)
# Extract clean signature name from AddModuleScore output column
cor_df$Signature <- gsub("^MS_|_1$", "", cor_df$score_col)
# Working plotting table (will be filtered/ordered below)
plot_df <- cor_df
## ============================================================
## 5) COLLAPSE duplicate gene sets + order by best-associated cluster
## ============================================================
# Remove NAs (defensive)
plot_df <- subset(plot_df, !is.na(Signature))
cor_df <- subset(cor_df, !is.na(Signature))
# Map signatures to base pathway names (for collapsing related signatures)
plot_df$Pathway <- base_name_fun(plot_df$Signature)
cor_df$Pathway <- base_name_fun(cor_df$Signature)
# Identify best cluster per Signature
tmp <- cor_df
tmp$cnum <- as.numeric(as.character(tmp$cluster))
sig_stats <- do.call(
rbind,
lapply(split(tmp, tmp$Signature), function(df) {
if (all(is.na(df$correlation))) {
return(data.frame(Signature=df$Signature[1], best_cluster=Inf, best_cor=NA_real_))
}
i <- which.max(df$correlation)
data.frame(
Signature = df$Signature[i],
best_cluster = df$cnum[i],
best_cor = df$correlation[i]
)
})
)
sig_stats$Pathway <- base_name_fun(sig_stats$Signature)
# Keep strongest Signature per Pathway
keep_sig <- unlist(
lapply(split(sig_stats, sig_stats$Pathway), function(df) {
df$Signature[which.max(df$best_cor)]
})
)
plot_df <- subset(plot_df, Signature %in% keep_sig)
cor_df <- subset(cor_df, Signature %in% keep_sig)
# Collapse names to base pathway (after filtering)
plot_df$Signature <- base_name_fun(plot_df$Signature)
cor_df$Signature <- base_name_fun(cor_df$Signature)
# Recompute best cluster per Pathway (for x-axis ordering)
tmp2 <- cor_df
tmp2$cnum <- as.numeric(as.character(tmp2$cluster))
path_stats <- do.call(
rbind,
lapply(split(tmp2, tmp2$Signature), function(df) {
if (all(is.na(df$correlation))) {
return(data.frame(Signature=df$Signature[1], best_cluster=Inf, best_cor=NA_real_))
}
i <- which.max(df$correlation)
data.frame(
Signature = df$Signature[i],
best_cluster = df$cnum[i],
best_cor = df$correlation[i]
)
})
)
path_stats <- path_stats[order(path_stats$best_cluster, -path_stats$best_cor), ]
sig_order <- path_stats$Signature
sig_order <- sig_order[sig_order %in% plot_df$Signature]
plot_df$Signature <- factor(plot_df$Signature, levels = sig_order)
## ============================================================
## 6) CLUSTER LABELS FROM tumcells$tumcells_annotation (0 at top)
## ============================================================
meta_all <- tumcells@meta.data
# Map numeric seurat cluster IDs to biological state labels
cl_map <- unique(meta_all[, c("seurat_clusters", "tumcells_annotation")])
cl_map <- cl_map[!is.na(cl_map$tumcells_annotation), ]
cl_map$seurat_clusters <- as.character(cl_map$seurat_clusters)
cl_num <- sort(unique(as.numeric(cl_map$seurat_clusters)))
cluster_levels <- rev(as.character(cl_num)) # 0 on top
lab_vec <- cl_map$tumcells_annotation[
match(as.character(cl_num), cl_map$seurat_clusters)
]
names(lab_vec) <- as.character(cl_num)
# Apply cluster order and create label vector in plotting order
plot_df$cluster <- factor(as.character(plot_df$cluster), levels = cluster_levels)
cluster_labels <- lab_vec[levels(plot_df$cluster)]
## ============================================================
## 7) NICE NAMES FOR GENE SETS (x-axis labels)
## ============================================================
nice_names <- c(
WNTon = "WNT-on stemness",
Lgr5 = "Lgr5+ stem cells",
coreHRC = "core HRC program",
Oncofetal = "Oncofetal",
FetalMustata = "Fetal epithelium",
EpiHR = "Epi-HR",
YAP = "YAP/TAZ targets",
Ki67 = "Proliferation",
KRASup = "KRAS-up",
H_WNT_BETA = "Hallmark WNT/β-cat.",
EMT = "EMT",
H_TNFA_NFKB = "TNFα–NF-κB",
H_INFLAM = "Inflamm. response",
NEG_WNT = "Negative WNT reg.",
STEM_PROL = "Stem cell prolifer.",
IFNg = "IFN-γ response",
IFNa = "IFN-α response",
INNATE = "Innate immune resp.",
NEG_CHEMOKINE = "Neg. chemokine reg.",
NEG_INNATE = "Neg. innate reg.",
NEG_IMMUNE = "Neg. immune reg.",
OXPHOS = "Oxidative phosph.",
MAINT_GI = "GI epithelium maint.",
INT_ABS = "Intestinal absorp.",
INT_LIP_ABS = "Lipid absorption",
REG_INT_ABS = "Reg. intestinal abs.",
REG_INT_LIP_ABS = "Reg. lipid abs.",
INNATE_MUCOSA8 = "Innate mucosa (Paneth)",
INNATE_MUCOSA11 = "Innate mucosa (Apoa4)",
COAG = "Coagulation",
ANGIO = "Angiogenesis",
APICAL_SURF = "Apical surface"
)
x_levels <- levels(plot_df$Signature)
x_labs <- setNames(x_levels, x_levels)
x_labs[names(nice_names)] <- nice_names[names(nice_names) %in% x_levels]
## ============================================================
## 8) GRID DATA FOR geom_tile (background squares)
## ============================================================
grid_df <- expand.grid(
Signature = levels(plot_df$Signature),
cluster = levels(plot_df$cluster)
)
## ============================================================
## 9) FINAL GRID BUBBLE PLOT
## ============================================================
p <- ggplot() +
# square grid background
geom_tile(
data = grid_df,
aes(x = Signature, y = cluster),
color = "grey85",
fill = NA,
linewidth = 0.4
) +
# bubbles encode correlation magnitude and sign
geom_point(
data = plot_df,
aes(
x = Signature,
y = cluster,
size = correlation,
fill = correlation
),
shape = 21,
stroke = 0.8,
color = "black"
) +
# blue → white → red correlation gradient
scale_fill_gradientn(
colours = c(
"#2166AC", # dark blue
"#67A9CF", # light blue
"#F7F7F7", # white
"#D64545", # light red/orange
"darkred" # strong red
),
values = rescale(c(-1, -0.5, 0, 0.5, 1)),
limits = c(-1, 1),
breaks = c(-1, 0, 1),
labels = c("-1", "0", "1"),
name = "Pearson correlation"
) +
scale_size(range = c(1, 5), name = "Correlation") +
scale_y_discrete(labels = cluster_labels) +
scale_x_discrete(labels = x_labs) +
theme_bw(base_size = 12) +
theme(
panel.background = element_blank(),
panel.grid = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 8, face = "bold"),
plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
legend.text = element_text(size = 8),
legend.title = element_text(size = 10, face = "bold")
) +
labs(
title = "Signature × tumor cell state landscape\n(correlation-based score)",
y = "Tumor cell state"
)
print(p)This notebook provides the complete, fully reproducible analytical workflow underlying Figure 3, from raw single-cell state annotation and pathway scoring to compositional analysis and lineage reconstruction. All steps are performed directly on the annotated tumor epithelial Seurat object, with consistent color schemes, state ordering, and model hierarchies enforced across panels to ensure interpretability and cross-panel comparability. The combined visualization of discrete cell states, continuous gene-program activity, and Slingshot-inferred trajectories establishes a coherent framework to relate genotype, transcriptional plasticity, and developmental progression within the tumor epithelium. This framework can be readily extended to additional gene programs, perturbations, or datasets, and serves as a reference pipeline for state-resolved, trajectory-aware analysis of epithelial tumor ecosystems.